[C++][Parquet] GH-47628: Implement basic parquet file rewriter #47775

HuaHuaY · 2025-10-10T06:58:56Z

This is a draft PR now. I follow Java's implementation but I think it is not a good enough design for C++. Because we must copy lots of code from file_writer.cc or file_reader.cc and it will be troublesome to maintain in the future. I prefer to implement some classes inheriting XXXWriter or XXXReader. I'll think about how to refactor the code. If anyone has any good suggestions, please comment.

Now I have written two kinds of tests. Test the horizontal splicing and vertical splicing of parquet files separately. But only horizontal splicing is implemented now because I don't find an efficient way to merge two parquet files' schema.

Rationale for this change

Allow to rewrite parquet files in binary data formats instead of reading, decoding all values and writing them.

What changes are included in this PR?

Add class ParquetFileRewriter and RewriterProperties.
Add some to_thrift and SetXXX methods to help me copy the metadata.
Add CopyStream methods to call memcpy between ArrowInputStream and ArrowOutputStream.
Add RowGroupMetaDataBuilder::NextColumnChunk(std::unique_ptr<ColumnChunkMetaData> cc_metadata, int64_t shift) which allows to add column metadata without creating ColumnChunkMetaDataBuilder.

Are these changes tested?

Yes

Are there any user-facing changes?

Add some new classes and methods mentioned above.
ReaderProperties::GetStream is changed to a const method. Only the signature has been changed. Its original implementation allows it to be declared as a const method.

GitHub Issue: [C++][Parquet] Provide a rewriter to rewrite parquet files without decoding all the row groups/pages #47628

github-actions · 2025-10-10T06:59:20Z

Thanks for opening a pull request!

If this is not a minor PR. Could you open an issue for this pull request on GitHub? https://github.com/apache/arrow/issues/new/choose

Opening GitHub issues ahead of time contributes to the Openness of the Apache Arrow project.

Then could you also rename the pull request title in the following format?

GH-${GITHUB_ISSUE_ID}: [${COMPONENT}] ${SUMMARY}

or

MINOR: [${COMPONENT}] ${SUMMARY}

See also:

HuaHuaY · 2025-10-13T04:07:14Z

@pitrou @adamreeve @mapleFU Do you have any suggestions about this draft? Is there any efficient way to merge two parquet files' schema?

mapleFU

Emm I'm thinking that just reuse the current code a ok way, since these logic in current impl would be a bit hacking with current interface...

cpp/src/parquet/properties.h

mapleFU · 2025-10-13T04:19:02Z

cpp/src/parquet/page_index.cc

  template <typename Builder>
  void SerializeIndex(
      const std::vector<std::vector<std::unique_ptr<Builder>>>& page_index_builders,
+      const std::vector<std::vector<std::unique_ptr<Index<Builder>>>>& page_indices,


Can this separate to different method? This reuse is a bit hacking to me

wgtmac

I haven't reviewed all the changes yet and will progressively post my comments.

cpp/src/parquet/properties.h

cpp/src/parquet/page_index.cc

cpp/src/parquet/page_index.h

cpp/src/parquet/properties.h

cpp/src/parquet/platform.h

cpp/src/parquet/file_rewriter.h

wgtmac

The general workflow of the rewriter looks good to me. However, I don't believe we should directly manipulate the thrift objects.

cpp/src/parquet/file_rewriter.h

wgtmac · 2025-10-16T08:17:44Z